home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
The Works of John Ruskin
/
The Works of John Ruskin - Installation CD.iso
/
WorksSetup.exe
/
CD
/
BIN
/
LATIN_1.CPL
< prev
next >
Wrap
Text File
|
1995-10-02
|
12KB
|
361 lines
####################################################################
#
#
# File: latin_1.def
#
# Personal Library Software, July, 1993
# Tom Donaldson
#
# Function: Tokenizer definitional data for table driven tokenizer.
# This file defines a basic isalnum() tokenization for the 8-bit
# LATIN1 character set (upper 128 values are "European" characters).
#
# The CplTabledRomanceTokenizer allows customization of tokenization by
# editing rules that define the operation of the tokenizer. Central
# concept is "word continuation" rules, defining characters-kinds that
# CANNOT be split from each other.
#
# History
# -------
#
# 31aug93 tomd Created from ctypes.def
#
####################################################################
####################################################################
#
# Installation
# ============
#
# Database.def File
# -----------------
#
# To use the CplTabledRomanceTokenizer, you need this line in the .def
# file for the database:
#
# TOKENIZER = CplTabledRomanceTokenizer
#
#
# Tokenizer File
# --------------
#
# This file, latin_1.def, is the rule file. The tokenizer REQUIRES that
# its definition file be named "tknztbld.def". Therefore, you MUST copy
# this file as "tknztbld.def". The "tknztbld.def" file MUST be in the
# "home directory" of the database using the tokenizer, or the "system"
# directory for the CPL installation.
#
# Note that a tknztbld.def in the database's home directory takes
# precedence over a tknztbld.def in the CPL "system" directory.
#
#
####################################################################
####################################################################
#
# Section 1: Character Class Definitions
#
####################################################################
# The only rule needed for this C-type isalnum() style of tokenization
# is a "letter" rule. All characters that can take part in a token must
# be classified as a "letter". Such "letter" characters will be
# unconditionally included in tokens, and "letter" characters will be
# unconditionally considered inseparable.
# Name
# ----
Letter
EndRule
####################################################################
#
# Section 2: Character Classification Map
#
####################################################################
# ------- ----- -----------------------
# Decimal Class
# Value Name Comment
# ------- ----- -----------------------
# Digits: Note that they are classified as Letter, which is the only
# character class defined.
48 Letter # Char '0'
49 Letter # Char '1'
50 Letter # Char '2'
51 Letter # Char '3'
52 Letter # Char '4'
53 Letter # Char '5'
54 Letter # Char '6'
55 Letter # Char '7'
56 Letter # Char '8'
57 Letter # Char '9'
# Upper case letters:
65 Letter # Char 'A'
66 Letter # Char 'B'
67 Letter # Char 'C'
68 Letter # Char 'D'
69 Letter # Char 'E'
70 Letter # Char 'F'
71 Letter # Char 'G'
72 Letter # Char 'H'
73 Letter # Char 'I'
74 Letter # Char 'J'
75 Letter # Char 'K'
76 Letter # Char 'L'
77 Letter # Char 'M'
78 Letter # Char 'N'
79 Letter # Char 'O'
80 Letter # Char 'P'
81 Letter # Char 'Q'
82 Letter # Char 'R'
83 Letter # Char 'S'
84 Letter # Char 'T'
85 Letter # Char 'U'
86 Letter # Char 'V'
87 Letter # Char 'W'
88 Letter # Char 'X'
89 Letter # Char 'Y'
90 Letter # Char 'Z'
# Lower case letters:
97 Letter # Char 'a'
98 Letter # Char 'b'
99 Letter # Char 'c'
100 Letter # Char 'd'
101 Letter # Char 'e'
102 Letter # Char 'f'
103 Letter # Char 'g'
104 Letter # Char 'h'
105 Letter # Char 'i'
106 Letter # Char 'j'
107 Letter # Char 'k'
108 Letter # Char 'l'
109 Letter # Char 'm'
110 Letter # Char 'n'
111 Letter # Char 'o'
112 Letter # Char 'p'
113 Letter # Char 'q'
114 Letter # Char 'r'
115 Letter # Char 's'
116 Letter # Char 't'
117 Letter # Char 'u'
118 Letter # Char 'v'
119 Letter # Char 'w'
120 Letter # Char 'x'
121 Letter # Char 'y'
122 Letter # Char 'z'
# LATIN1 characters.
#
# Comment column is the Unicode name for the character.
#
# RemovedC1 control codes.
#
# Begin Latin-1 non-breaking space character -- is a letter
160 Letter # NON-BREAKING SPACE
# Uppercase letters
192 Letter # LATIN CAPITAL LETTER A GRAVE
193 Letter # LATIN CAPITAL LETTER A ACUTE
194 Letter # LATIN CAPITAL LETTER A CIRCUMFLEX
195 Letter # LATIN CAPITAL LETTER A TILDE
196 Letter # LATIN CAPITAL LETTER A DIAERESIS
197 Letter # LATIN CAPITAL LETTER A RING
198 Letter # LATIN CAPITAL LETTER A E
199 Letter # LATIN CAPITAL LETTER C CEDILLA
200 Letter # LATIN CAPITAL LETTER E GRAVE
201 Letter # LATIN CAPITAL LETTER E ACUTE
202 Letter # LATIN CAPITAL LETTER E CIRCUMFLEX
203 Letter # LATIN CAPITAL LETTER E DIAERESIS
204 Letter # LATIN CAPITAL LETTER I GRAVE
205 Letter # LATIN CAPITAL LETTER I ACUTE
206 Letter # LATIN CAPITAL LETTER I CIRCUMFLEX
207 Letter # LATIN CAPITAL LETTER I DIAERESIS
208 Letter # LATIN CAPITAL LETTER ETH
209 Letter # LATIN CAPITAL LETTER N TILDE
210 Letter # LATIN CAPITAL LETTER O GRAVE
211 Letter # LATIN CAPITAL LETTER O ACUTE
212 Letter # LATIN CAPITAL LETTER O CIRCUMFLEX
213 Letter # LATIN CAPITAL LETTER O TILDE
214 Letter # LATIN CAPITAL LETTER O DIAERESIS
# Removed multiplication sign
216 Letter # LATIN CAPITAL LETTER O SLASH
217 Letter # LATIN CAPITAL LETTER U GRAVE
218 Letter # LATIN CAPITAL LETTER U ACUTE
219 Letter # LATIN CAPITAL LETTER U CIRCUMFLEX
220 Letter # LATIN CAPITAL LETTER U DIAERESIS
221 Letter # LATIN CAPITAL LETTER Y ACUTE
222 Letter # LATIN CAPITAL LETTER THORN
# Lowercase letters
223 Letter # LATIN SMALL LETTER SHARP S
224 Letter # LATIN SMALL LETTER A GRAVE
225 Letter # LATIN SMALL LETTER A ACUTE
226 Letter # LATIN SMALL LETTER A CIRCUMFLEX
227 Letter # LATIN SMALL LETTER A TILDE
228 Letter # LATIN SMALL LETTER A DIAERESIS
229 Letter # LATIN SMALL LETTER A RING
230 Letter # LATIN SMALL LETTER A E
231 Letter # LATIN SMALL LETTER C CEDILLA
232 Letter # LATIN SMALL LETTER E GRAVE
233 Letter # LATIN SMALL LETTER E ACUTE
234 Letter # LATIN SMALL LETTER E CIRCUMFLEX
235 Letter # LATIN SMALL LETTER E DIAERESIS
236 Letter # LATIN SMALL LETTER I GRAVE
237 Letter # LATIN SMALL LETTER I ACUTE
238 Letter # LATIN SMALL LETTER I CIRCUMFLEX
239 Letter # LATIN SMALL LETTER I DIAERESIS
240 Letter # LATIN SMALL LETTER ETH
241 Letter # LATIN SMALL LETTER N TILDE
242 Letter # LATIN SMALL LETTER O GRAVE
243 Letter # LATIN SMALL LETTER O ACUTE
244 Letter # LATIN SMALL LETTER O CIRCUMFLEX
245 Letter # LATIN SMALL LETTER O TILDE
246 Letter # LATIN SMALL LETTER O DIAERESIS
# Removed division sign
248 Letter # LATIN SMALL LETTER O SLASH
249 Letter # LATIN SMALL LETTER U GRAVE
250 Letter # LATIN SMALL LETTER U ACUTE
251 Letter # LATIN SMALL LETTER U CIRCUMFLEX
252 Letter # LATIN SMALL LETTER U DIAERESIS
253 Letter # LATIN SMALL LETTER Y ACUTE
254 Letter # LATIN SMALL LETTER THORN
255 Letter # LATIN SMALL LETTER Y DIAERESIS
# --- ----- -----------------------
-1 EndOfDefs # Not loaded. Just marks end of map definition.
# --- ----- -----------------------
####################################################################
#
# Section 3: Word Continuation Rules
#
####################################################################
# There is only one rule. Letter characters cannot be separated from
# each other, ever, and only Letter characters can be in tokens.
Letter *
EndRule
####################################################################
#
# Section 4: Canonization Map
#
####################################################################
# ------- ------- -----------
# Input Output
# Decimal Decimal
# Char Char
# Value Value Comment
# ------- ------- -----------
#
# Map the characters a-z to the "canonical" characters A-Z. That is,
# all letters will be upper cased.
97 65 # Char 'a' canonizes to 'A'
98 66 # Char 'b' canonizes to 'B'
99 67 # Char 'c' canonizes to 'C'
100 68 # Char 'd' canonizes to 'D'
101 69 # Char 'e' canonizes to 'E'
102 70 # Char 'f' canonizes to 'F'
103 71 # Char 'g' canonizes to 'G'
104 72 # Char 'h' canonizes to 'H'
105 73 # Char 'i' canonizes to 'I'
106 74 # Char 'j' canonizes to 'J'
107 75 # Char 'k' canonizes to 'K'
108 76 # Char 'l' canonizes to 'L'
109 77 # Char 'm' canonizes to 'M'
110 78 # Char 'n' canonizes to 'N'
111 79 # Char 'o' canonizes to 'O'
112 80 # Char 'p' canonizes to 'P'
113 81 # Char 'q' canonizes to 'Q'
114 82 # Char 'r' canonizes to 'R'
115 83 # Char 's' canonizes to 'S'
116 84 # Char 't' canonizes to 'T'
117 85 # Char 'u' canonizes to 'U'
118 86 # Char 'v' canonizes to 'V'
119 87 # Char 'w' canonizes to 'W'
120 88 # Char 'x' canonizes to 'X'
121 89 # Char 'y' canonizes to 'Y'
122 90 # Char 'z' canonizes to 'Z'
# How to canonize this one?
# 223 223 # LATIN SMALL LETTER SHARP S --> canonize to what???
# Correct German uppercase is "SS".
# We must stick with 8-bits for now,
# so cannot do the "correct" thing.
224 192 # LATIN SMALL LETTER A GRAVE --> LATIN CAPITAL LETTER A GRAVE
225 193 # LATIN SMALL LETTER A ACUTE --> LATIN CAPITAL LETTER A ACUTE
226 194 # LATIN SMALL LETTER A CIRCUMFLEX --> LATIN CAPITAL LETTER A CIRCUMFLEX
227 195 # LATIN SMALL LETTER A TILDE --> LATIN CAPITAL LETTER A TILDE
228 196 # LATIN SMALL LETTER A DIAERESIS --> LATIN CAPITAL LETTER A DIAERESIS
229 197 # LATIN SMALL LETTER A RING --> LATIN CAPITAL LETTER A RING
230 198 # LATIN SMALL LETTER A E --> LATIN CAPITAL LETTER A E
231 199 # LATIN SMALL LETTER C CEDILLA --> LATIN CAPITAL LETTER C CEDILLA
232 200 # LATIN SMALL LETTER E GRAVE --> LATIN CAPITAL LETTER E GRAVE
233 201 # LATIN SMALL LETTER E ACUTE --> LATIN CAPITAL LETTER E ACUTE
234 202 # LATIN SMALL LETTER E CIRCUMFLEX --> LATIN CAPITAL LETTER E CIRCUMFLEX
235 203 # LATIN SMALL LETTER E DIAERESIS --> LATIN CAPITAL LETTER E DIAERESIS
236 204 # LATIN SMALL LETTER I GRAVE --> LATIN CAPITAL LETTER I GRAVE
237 205 # LATIN SMALL LETTER I ACUTE --> LATIN CAPITAL LETTER I ACUTE
238 206 # LATIN SMALL LETTER I CIRCUMFLEX --> LATIN CAPITAL LETTER I CIRCUMFLEX
239 207 # LATIN SMALL LETTER I DIAERESIS --> LATIN CAPITAL LETTER I DIAERESIS
240 208 # LATIN SMALL LETTER ETH --> LATIN CAPITAL LETTER ETH
241 209 # LATIN SMALL LETTER N TILDE --> LATIN CAPITAL LETTER N TILDE
242 210 # LATIN SMALL LETTER O GRAVE --> LATIN CAPITAL LETTER O GRAVE
243 211 # LATIN SMALL LETTER O ACUTE --> LATIN CAPITAL LETTER O ACUTE
244 212 # LATIN SMALL LETTER O CIRCUMFLEX --> LATIN CAPITAL LETTER O CIRCUMFLEX
245 213 # LATIN SMALL LETTER O TILDE --> LATIN CAPITAL LETTER O TILDE
246 214 # LATIN SMALL LETTER O DIAERESIS --> LATIN CAPITAL LETTER O DIAERESIS
248 216 # LATIN SMALL LETTER O SLASH --> LATIN CAPITAL LETTER O SLASH
249 217 # LATIN SMALL LETTER U GRAVE --> LATIN CAPITAL LETTER U GRAVE
250 218 # LATIN SMALL LETTER U ACUTE --> LATIN CAPITAL LETTER U ACUTE
251 219 # LATIN SMALL LETTER U CIRCUMFLEX --> LATIN CAPITAL LETTER U CIRCUMFLEX
252 220 # LATIN SMALL LETTER U DIAERESIS --> LATIN CAPITAL LETTER U DIAERESIS
253 221 # LATIN SMALL LETTER Y ACUTE --> LATIN CAPITAL LETTER Y ACUTE
254 222 # LATIN SMALL LETTER THORN --> LATIN CAPITAL LETTER THORN
# How to canonize this one?
# 255 255 # LATIN SMALL LETTER Y DIAERESIS --> canonize to what?????
# --- ----- -----------------------
-1 -1 # Not loaded. Just marks end of map definition.
# --- ----- -----------------------
####################################################################
#
#
# End Of File: latin_1.def
#
#
####################################################################